60 research outputs found

    BagMinHash - Minwise Hashing Algorithm for Weighted Sets

    Full text link
    Minwise hashing has become a standard tool to calculate signatures which allow direct estimation of Jaccard similarities. While very efficient algorithms already exist for the unweighted case, the calculation of signatures for weighted sets is still a time consuming task. BagMinHash is a new algorithm that can be orders of magnitude faster than current state of the art without any particular restrictions or assumptions on weights or data dimensionality. Applied to the special case of unweighted sets, it represents the first efficient algorithm producing independent signature components. A series of tests finally verifies the new algorithm and also reveals limitations of other approaches published in the recent past.Comment: 10 pages, KDD 201

    GEMRec: A graph-based emotion-aware music recommendation approach

    Full text link
    © Springer International Publishing AG 2016. Music recommendation has gained substantial attention in recent times. As one of the most important context features,user emotion has great potential to improve recommendations,but this has not yet been sufficiently explored due to the difficulty of emotion acquisition and incorporation. This paper proposes a graph-based emotion-aware music recommendation approach (GEMRec) by simultaneously taking a user’s music listening history and emotion into consideration. The proposed approach models the relations between user,music,and emotion as a three-element tuple (user,music,emotion),upon which an Emotion Aware Graph (EAG) is built,and then a relevance propagation algorithm based on random walk is devised to rank the relevance of music items for recommendation. Evaluation experiments are conducted based on a real dataset collected from a Chinese microblog service in comparison to baselines. The results show that the emotional context from a user’s microblogs contributes to improving the performance of music recommendation in terms of hitrate,precision,recall,and F1 score

    Information Discovery on Electronic Health Records Using Authority Flow Techniques

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As the use of electronic health records (EHRs) becomes more widespread, so does the need to search and provide effective information discovery within them. Querying by keyword has emerged as one of the most effective paradigms for searching. Most work in this area is based on traditional Information Retrieval (IR) techniques, where each document is compared individually against the query. We compare the effectiveness of two fundamentally different techniques for keyword search of EHRs.</p> <p>Methods</p> <p>We built two ranking systems. The traditional BM25 system exploits the EHRs' content without regard to association among entities within. The Clinical ObjectRank (CO) system exploits the entities' associations in EHRs using an authority-flow algorithm to discover the most relevant entities. BM25 and CO were deployed on an EHR dataset of the cardiovascular division of Miami Children's Hospital. Using sequences of keywords as queries, sensitivity and specificity were measured by two physicians for a set of 11 queries related to congenital cardiac disease.</p> <p>Results</p> <p>Our pilot evaluation showed that CO outperforms BM25 in terms of sensitivity (65% vs. 38%) by 71% on average, while maintaining the specificity (64% vs. 61%). The evaluation was done by two physicians.</p> <p>Conclusions</p> <p>Authority-flow techniques can greatly improve the detection of relevant information in EHRs and hence deserve further study.</p

    Streaming histogram sketching for rapid microbiome analytics

    Get PDF
    Background: The growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research, allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time. To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching and classification of microbiome samples in near real time. Results: We apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using the pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we use a ‘real life’ example to show that histosketches can train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a random forest classifier that could accurately predict whether the neonate had received antibiotic treatment (97% accuracy, 96% precision) and could subsequently be used to classify microbiome data streams in less than 3 s. Conclusions: Our method offers a new approach to rapidly process microbiome data streams, allowing samples to be rapidly clustered, indexed and classified. We also provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2 GB microbiome in 50 s on a standard laptop using four cores, with the sketch occupying 3000 bytes of disk space

    Markov Chain Ontology Analysis (MCOA)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biomedical ontologies have become an increasingly critical lens through which researchers analyze the genomic, clinical and bibliographic data that fuels scientific research. Of particular relevance are methods, such as enrichment analysis, that quantify the importance of ontology classes relative to a collection of domain data. Current analytical techniques, however, remain limited in their ability to handle many important types of structural complexity encountered in real biological systems including class overlaps, continuously valued data, inter-instance relationships, non-hierarchical relationships between classes, semantic distance and sparse data.</p> <p>Results</p> <p>In this paper, we describe a methodology called Markov Chain Ontology Analysis (MCOA) and illustrate its use through a MCOA-based enrichment analysis application based on a generative model of gene activation. MCOA models the classes in an ontology, the instances from an associated dataset and all directional inter-class, class-to-instance and inter-instance relationships as a single finite ergodic Markov chain. The adjusted transition probability matrix for this Markov chain enables the calculation of eigenvector values that quantify the importance of each ontology class relative to other classes and the associated data set members. On both controlled Gene Ontology (GO) data sets created with Escherichia coli, Drosophila melanogaster and Homo sapiens annotations and real gene expression data extracted from the Gene Expression Omnibus (GEO), the MCOA enrichment analysis approach provides the best performance of comparable state-of-the-art methods.</p> <p>Conclusion</p> <p>A methodology based on Markov chain models and network analytic metrics can help detect the relevant signal within large, highly interdependent and noisy data sets and, for applications such as enrichment analysis, has been shown to generate superior performance on both real and simulated data relative to existing state-of-the-art approaches.</p

    The gene normalization task in BioCreative III

    Get PDF
    BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance

    Evaluation of appendicitis risk prediction models in adults with suspected appendicitis

    Get PDF
    Background Appendicitis is the most common general surgical emergency worldwide, but its diagnosis remains challenging. The aim of this study was to determine whether existing risk prediction models can reliably identify patients presenting to hospital in the UK with acute right iliac fossa (RIF) pain who are at low risk of appendicitis. Methods A systematic search was completed to identify all existing appendicitis risk prediction models. Models were validated using UK data from an international prospective cohort study that captured consecutive patients aged 16–45 years presenting to hospital with acute RIF in March to June 2017. The main outcome was best achievable model specificity (proportion of patients who did not have appendicitis correctly classified as low risk) whilst maintaining a failure rate below 5 per cent (proportion of patients identified as low risk who actually had appendicitis). Results Some 5345 patients across 154 UK hospitals were identified, of which two‐thirds (3613 of 5345, 67·6 per cent) were women. Women were more than twice as likely to undergo surgery with removal of a histologically normal appendix (272 of 964, 28·2 per cent) than men (120 of 993, 12·1 per cent) (relative risk 2·33, 95 per cent c.i. 1·92 to 2·84; P < 0·001). Of 15 validated risk prediction models, the Adult Appendicitis Score performed best (cut‐off score 8 or less, specificity 63·1 per cent, failure rate 3·7 per cent). The Appendicitis Inflammatory Response Score performed best for men (cut‐off score 2 or less, specificity 24·7 per cent, failure rate 2·4 per cent). Conclusion Women in the UK had a disproportionate risk of admission without surgical intervention and had high rates of normal appendicectomy. Risk prediction models to support shared decision‐making by identifying adults in the UK at low risk of appendicitis were identified

    Integrating Semantic Tagging with Popularity-Based Page Rank for Next Page Prediction

    No full text

    PageRank Algorithm

    No full text
    corecore